Unit 6: Text Data Processing 


¢« Sentiments: strong positive, weak positive, neutral, weak negative, strong negative, 
problems 


e Requests: customer requests 


Other Extraction Rules 


e Public Security relations and events: such as person-org, person-person, and travel events 


e Enterprise events, for example mergers and acquisitions and executive job changes 


Two methods of customizing entity extraction: 


1. Dictionaries support lists of known entities 


a. 


b. 


Formerly known as Name Catalog 


Standard form and variants will be matched and normalized 


. Source can be an XLS spreadsheet, XML file, or table 
. Package includes XSD with correct dictionary format 
. Supported for all languages 


. Dictionary supports multiple languages 


2. Rules allow definition of custom patterns 


a. 


e. 


Pattern matching language based on regular expressions, enhanced with natural 
language operators 


. Command line compilation 
. Rule customization supported in all languages 


. Packaged rule sets for Voice of Customer (sentiment, request), enterprise events, 


public security events 


Engage consulting or a partner for additional customization 


Dictionaries can be used to customize extraction for a customer or domain. A dictionary is a 
list of entities that are always extracted if one of their forms appears in the input. A dictionary 
is stored in XML format and compiled into a binary representation for runtime processing. 


Dictionaries can be used for name variation management, disambiguation of unknown 
entities, and control of entity recognition. Individual entities can be referred to using multiple 
names. People naturally associate some names and use them interchangeably. 


You can help the extraction process understand these variations by specifying them ina 
dictionary. Adding names to a dictionary improves the usefulness and accuracy of results by 
knowing that two separate entities refer to the same thing through analysis. Tips for 
managing interchangeable entities: 


e Example: IBM, International Business Machines, and Big Blue, all refer to the same 
company. 


e Picka standard form, such as IBM, and make all other forms a variant of this form. 


e Post-extraction, all entities that have the standard form, IBM, can be grouped together 
even if the input text used a different form. 
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